46 research outputs found
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework
We propose a self-supervised method to learn feature representations from
videos. A standard approach in traditional self-supervised methods uses
positive-negative data pairs to train with contrastive learning strategy. In
such a case, different modalities of the same video are treated as positives
and video clips from a different video are treated as negatives. Because the
spatio-temporal information is important for video representation, we extend
the negative samples by introducing intra-negative samples, which are
transformed from the same anchor video by breaking temporal relations in video
clips. With the proposed Inter-Intra Contrastive (IIC) framework, we can train
spatio-temporal convolutional networks to learn video representations. There
are many flexible options in our IIC framework and we conduct experiments by
using several different configurations. Evaluations are conducted on video
retrieval and video recognition tasks using the learned video representation.
Our proposed IIC outperforms current state-of-the-art results by a large
margin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101
and HMDB51 datasets for video retrieval, respectively. For video recognition,
improvements can also be obtained on these two benchmark datasets. Code is
available at
https://github.com/BestJuly/Inter-intra-video-contrastive-learning.Comment: Accepted by ACMMM 2020. Our project page is at
https://bestjuly.github.io/Inter-intra-video-contrastive-learning
Intent Identification and Entity Extraction for Healthcare Queries in Indic Languages
Scarcity of data and technological limitations for resource-poor languages in
developing countries like India poses a threat to the development of
sophisticated NLU systems for healthcare. To assess the current status of
various state-of-the-art language models in healthcare, this paper studies the
problem by initially proposing two different Healthcare datasets, Indian
Healthcare Query Intent-WebMD and 1mg (IHQID-WebMD and IHQID-1mg) and one real
world Indian hospital query data in English and multiple Indic languages
(Hindi, Bengali, Tamil, Telugu, Marathi and Gujarati) which are annotated with
the query intents as well as entities. Our aim is to detect query intents and
extract corresponding entities. We perform extensive experiments on a set of
models in various realistic settings and explore two scenarios based on the
access to English data only (less costly) and access to target language data
(more expensive). We analyze context specific practical relevancy through
empirical analysis. The results, expressed in terms of overall F1 score show
that our approach is practically useful to identify intents and entities
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Video understanding tasks take many forms, from action detection to visual
query localization and spatio-temporal grounding of sentences. These tasks
differ in the type of inputs (only video, or video-query pair where query is an
image region or sentence) and outputs (temporal segments or spatio-temporal
tubes). However, at their core they require the same fundamental understanding
of the video, i.e., the actors and objects in it, their actions and
interactions. So far these tasks have been tackled in isolation with
individual, highly specialized architectures, which do not exploit the
interplay between tasks. In contrast, in this paper, we present a single,
unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic
Memory benchmark which entail queries of three different forms: given an
egocentric video and a visual, textual or activity query, the goal is to
determine when and where the answer can be seen within the video. Our model
design is inspired by recent query-based approaches to spatio-temporal
grounding, and contains modality-specific query encoders and task-specific
sliding window inference that allow multi-task training with diverse input
modalities and different structured outputs. We exhaustively analyze
relationships among the tasks and illustrate that cross-task learning leads to
improved performance on each individual task, as well as the ability to
generalize to unseen tasks, such as zero-shot spatial localization of language
queries
Beyond Simple Meta-Learning: Multi-Purpose Models for Multi-Domain, Active and Continual Few-Shot Learning
Modern deep learning requires large-scale extensively labelled datasets for
training. Few-shot learning aims to alleviate this issue by learning
effectively from few labelled examples. In previously proposed few-shot visual
classifiers, it is assumed that the feature manifold, where classifier
decisions are made, has uncorrelated feature dimensions and uniform feature
variance. In this work, we focus on addressing the limitations arising from
this assumption by proposing a variance-sensitive class of models that operates
in a low-label regime. The first method, Simple CNAPS, employs a hierarchically
regularized Mahalanobis-distance based classifier combined with a state of the
art neural adaptive feature extractor to achieve strong performance on
Meta-Dataset, mini-ImageNet and tiered-ImageNet benchmarks. We further extend
this approach to a transductive learning setting, proposing Transductive CNAPS.
This transductive method combines a soft k-means parameter refinement procedure
with a two-step task encoder to achieve improved test-time classification
accuracy using unlabelled data. Transductive CNAPS achieves state of the art
performance on Meta-Dataset. Finally, we explore the use of our methods (Simple
and Transductive) for "out of the box" continual and active learning. Extensive
experiments on large scale benchmarks illustrate robustness and versatility of
this, relatively speaking, simple class of models. All trained model
checkpoints and corresponding source codes have been made publicly available
Evaluation of Interactive Rhythm Activities on the Engagement Level of Individuals with Memory Impairments
Alzheimer\u27s dementia can lead to a decreased quality of life in patients through the manifestation of inappropriate behavioral and psychological signs and symptoms. Music therapy has been shown to decrease agitation and disruptive behaviors in patients with dementia, although improvement in overall cognitive function was minimal. However, there is evidence showing an increase in grey matter in those who actively participate in music activities. Our goal in this study is to focus on how participation in rhythm-based activities affects quality of life.https://scholarworks.uvm.edu/comphp_gallery/1276/thumbnail.jp